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Abstract. Familiar statistical tests and estimates are obtained by the 
direct observation of cases of interest: a clinical trial of a new drug, for 
instance, will compare the drug's effects on a relevant set of patients 
and controls. Sometimes, though, indirect evidence may be temptingly 
available, perhaps the results of previous trials on closely related drugs. 
Very roughly speaking, the difference between direct and indirect sta- 
tistical evidence marks the boundary between frequentist and Bayesian 
thinking. Twentieth-century statistical practice focused heavily on di- 
rect evidence, on the grounds of superior objectivity. Now, however, 
new scientific devices such as microarrays routinely produce enormous 
data sets involving thousands of related situations, where indirect ev- 
idence seems too important to ignore. Empirical Bayes methodology 
offers an attractive direct/indirect compromise. There is already some 
evidence of a shift toward a less rigid standard of statistical objectivity 
that allows better use of indirect evidence. This article is basically the 
text of a recent talk featuring some examples from current practice, 
with a little bit of futuristic speculation. 

Key words and phrases: Statistical learning, experience of others, 
Bayesian and frequentist, James-Stein, Benjamini-Hochberg, 
False Discovery Rates, effect size. 



1. INTRODUCTION 

This article is the text of a talk I gave twice in 
2009, at the Objective Bayes Conference at Whar- 
ton, and at the Joint Statistical Meetings in Wash- 
ington, DC. Well, not quite the text. The printed 
page gives me a chance to repair a couple of the more 
gaping omissions in the verbal presentation, without 
violating its rule of avoiding almost all mathemati- 
cal technicalities. 
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Basically, however, I'll stick to the text, which was 
a broad-brush view of some recent trends in statis- 
tical applications — their rapidly increasing size and 
complexity — that are impinging on statistical the- 
ory, both frequentist and Bayesian. An OpEd piece 
on "practical philosophy" might be a good descrip- 
tion of what I was aiming for. Most of the talk (as 
I'll refer to this from now on) uses simple exam- 
ples, including some of my old favorites, to get at 
the main ideas. There is no attempt at careful refer- 
encing, just a short list of directly relevant sources 
mentioned at the end. 

I should warn you that the talk is organized more 
historically than logically. It starts with a few exam- 
ples of frequentist, Bayesian and empirical Bayesian 
analysis, all bearing on "indirect evidence," my catch- 
all term for useful information that isn't of obvious 
direct application to a question of interest. This is 
by way of a long build-up to my main point con- 
cerning the torrent of indirect evidence uncorked by 
modern scientific technologies such as the microar- 
ray. It is fair to say that we are living in a new era of 
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Fig. 1. Roberto Clemente's batting averages over the 1970 baseball season (partially simulated). After 45 tries he had 
hits for a batting average of 18/45 = 0.400; his average in the remainder of the season was 127/367 = 0.346. 



statistical applications, one that is putting pressure 
on traditional Bayesian and frequentist methodolo- 
gies. Toward the end of the talk I'll try to demon- 
strate some of the pitfalls and opportunities of the 
new era, finishing, as the title promises, with a few 
words about the future. 

2. DIRECT STATISTICAL EVIDENCE 

A statistical argument, at least in popular par- 
lance, is one in which many small pieces of evidence, 
often contradictory, are amassed to produce an over- 
all conclusion. A familiar and important example is 
the clinical trial of a promising new drug. We don't 
expect the drug to work on every patient, or for ev- 
ery placebo-receiving patient to fail, but perhaps, 
overall, the new drug will perform "significantly" 
better. 

The clinical trial is collecting direct statistical ev- 
idence, where each bit of data, a patient's success 
or failure, directly bears on the question of interest. 
Direct evidence, interpreted by frequentist methods, 
has been the prevalent mode of statistical appli- 
cation during the past century. It is strongly con- 
nected with the idea of scientific objectivity, which 
accounts, I believe, for the dominance of frequentism 
in scientific reporting. 

Figure 1 concerns an example of direct statisti- 
cal evidence, taken from the sports pages of 1970. 



We are following the star baseball player Roberto 
Clemente through his 1970 season. His batting av- 
erage, number of successes ("hits") over number of 
tries ("at bats") fluctuates wildly at first but set- 
tles down as the season progresses. After 45 tries he 
has 18 hits, for a batting average of 18/45 = 0.400 
or "four hundred" in baseball terminology. The re- 
mainder of the season is slightly less successful, with 
127 hits out of 367 at bats for a batting average of 
0.346 = 127/367, giving Clemente a full season aver- 
age of 0.352.^ This is a classic frequentist estimate: 
direct statistical evidence for Clemente's 1970 bat- 
ting ability. 

In contentious areas such as drug efficacy, the de- 
sire for direct evidence can be overpowering. A clin- 
ical trial often has three arms: placebo, single dose 
of new drug, and double dose. Even if the double 
dose/placebo comparison yields strongly significant 
results in favor of the new drug, a not-quite signifi- 
cant result for the single dose/placebo comparison, 
say p-value 0.07, will not be enough to earn FDA 
approval. The single dose by itself must prove its 
worth. 



^These numbers are accurate, but I have to admit to sim- 
ulating the rest of the figure by randomly dispersing his 18 
hits over the first 45 tries, and similarly for the last 127 hits. 
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My own feeling at this point would be that the 
single dose is very likely to be vindicated in any 
subsequent testing. The strong result for the dou- 
ble dose adds indirect evidence to the direct, nearly 
significant, single dose outcome. As the talk's title 
suggests, indirect statistical evidence is the focus of 
interest here. My main point, which will take a while 
to unfold, is that current scientific trends are pro- 
ducing larger and more complex data sets in which 
indirect evidence has to be accounted for: and these 
trends will force some re-thinking of both frequentist 
and Bayesian practices. 

3. BAYESIAN INFERENCE 

I was having coffee with a physicist friend and 
her husband who, thanks to the miracle of sono- 
grams, knew they were due to have twin boys. With- 
out warning, the mother-to-be asked me what was 
the probability her twins would be identical rather 
than fraternal. Stalling for time, I asked if the doc- 
tor had given her any further information. "Yes, he 
said the proportion of identical twins is one-third." 
(I checked later with an epidemiology colleague who 
confirmed this estimate.) 

Thomas Bayes, 18th-century non-conformist En- 
glish minister, would have died in vain if I didn't 
use his rule to answer the physicist mom. In this 
case the prior odds 

Pr{Identical} _ 1/3 _ 1 
Pr{Fraternal} ~2/3~2 

favor fraternal. However the likelihood ratio, the 
current evidence from the sonogram, favors identi- 
cal, 

Pr{Twin boys | Identical} ^ 
PrjTwin boys|Fraternal} ' 

since identical twins are always the same sex while 
fraternal twins are of differing sexes half the time. 

Bayes rule, published posthumously in 1763, is a 
rule for combining evidence from different sources. 
In this case it says that the posterior odds of identi- 
cal to fraternal is obtained by simple multiplication. 

Posterior odds = (Prior odds) • (Likelihood ratio) 

= i.2 = l. 

So my answer to the physicists was "50/50," equal 
chances of identical or fraternal. (This sounded like 
pure guessing to them; I would have gotten a lot 
more respect with "60/40.") 



Bayes rule is a landmark achievement. It was the 
first breakthrough in scientific logic since the Greeks 
and the beginning of statistical inference as a seri- 
ous mathematical subject. From the point of view 
of this talk, it also marked the formal introduction 
of indirect evidence into statistical learning. 

Both Clemente and the physicists are learning from 
experience. Clemente is learning directly from his 
own experience, in a strict frequentist manner. The 
physicists are learning from their own experience 
(the sonogram), but also indirectly from the experi- 
ence of others: that one-third/two-thirds prior odds 
is based on perhaps millions of previous twin births, 
mostly not of the physicists' "twin boys" situation. 
Another way to state Bayes rule IS clS ct device for 
filtering out and using the relevant portions of past 
experiences. 

All statisticians, or almost all of them, enjoy Bayes 
rule, but only a minority make much use of it. Learn- 
ing only from direct experience is a dominant feature 
of contemporary applied statistics, connected, as I 
said, with notions of scientific objectivity. A funda- 
mental Bayesian difficulty is that well-founded prior 
distributions, like the twins one-thirds/two-thirds, 
are rare in scientific practice. Much of 20th-century 
Bayesian theory concerned subjective prior distribu- 
tions, which are not very convincing in contentious 
areas such as drug trials. 

The holy grail of statistical theory is to use the 
experience of others without the need for subjec- 
tive prior distributions: in L. J. Savage's words, to 
enjoy the Bayesian omelette without breaking the 
Bayesian eggs. I am going to argue that this grail 
has grown holier, and more pressing, in the 21st cen- 
tury. First though I wanted to say something about 
frequentist use of indirect information. 

4. REGRESSION MODELS 

Bayesians have an advantage but not a monopoly 
on the use of indirect evidence. Regression models 
provide an officially sanctioned^ frequentist mecha- 
nism for incorporating the experience of others. 

Figure 2 concerns an example from Dr. Brian My- 
ers' Stanford nephrology laboratory: 157 healthy vol- 
unteers have had their kidney function evaluated by 
a somewhat arduous series of tests. An overall kid- 
ney score, higher numbers better, is plotted versus 



^Sanctioned, though not universally accepted as fully rele- 
vant, as the three-arm drug example showed. 
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Fig. 2. Kidney function plotted versus age for 157 healthy volunteers from the nephrology laboratory of Dr. Brian Myers. 
The least squares regression line has a strong downward slope. A new donor age 55 has appeared, and we need to predict his 
kidney score. 



the volunteer's age, illustrating a decline in func- 
tion among the older subjects. (Kidney donation 
was once limited to volunteers less than 60 years 
old.) The decline is emphasized by the downward 
slope of the least squares regression line. 

A potential new donor, age 55, has appeared but 
it is not practical to evaluate his kidney function 
by the arduous testing procedure. How good are his 
kidneys? As far as direct evidence is concerned, only 
one of the 157 volunteers was 55, and he had score 
—0.01. Most statisticians would prefer the estimate 
obtained from the height at age 55 of the least square 
line, —1.46. In Tukey's evocative language, we are 
"borrowing strength" from the 156 volunteers who 
are not age 55. 

Borrowing strength is a clear use of indirect ev- 
idence, but invoked differently than through Bayes 
theorem. Now every individual is adjusted to fit the 
case of interest; in effect the regression model al- 
lows us to adjust each volunteer to age 55. Linear 
model theory permits a direct frequentist analysis 
of the entire least squares fitting process, but that 
shouldn't conceal the indirect nature of its applica- 
tion to individual cases. 

One response of the statistical community to the 
onslaught of increasingly large and complex data 



sets has been to extend the reach of regression mod- 
els: LARS, lasso, boosting, bagging, CART and pro- 
jection pursuit being a few of the ambitious new 
data-mining algorithms. Every self-respecting sports 
program now has its own simplified data- mining pro- 
gram, producing statements like "Jones has only 3 
hits in 16 tries versus Pettitte." This is direct evi- 
dence run amok. Regression models seem to be con- 
sidered beyond the sporting public's sophistication, 
but indirect evidence is everywhere in the sports 
world, as I want to discuss next. 

5. JAMES-STEIN ESTIMATION 

Early in the 1970 baseball season, Carl Morris col- 
lected the batting average data shown in the second 
column of Table 1. Each of the 18 players had bat- 
ted 45 times (they were all of those who had done 
so) with varying degrees of success. Clemente, as 
shown in Figure 1, had hit successfully 18 of the 
45 times, for an observed average of 0.400 = 18/45. 
Near the bottom of the table, Thurman Munson, an- 
other star player, had only 8 hits; observed average 
8/45 = 0.178. The grand average of the 18 players 
at that point was 0.265. 

Only about one-tenth of the season had elapsed, 
and Morris considered predicting each player's sub- 
sequent batting average during the remainder of 1970 
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Since the players bat independently of each other — 
Clemente's successes don't help Munson, nor vice 
versa — it seems there is no alternative to using the 
observed averages, at least not without employing 
more baseball background knowledge. 

However, that is not true. The James-Stein esti- 
mates in the last column of the table are functions of 
the observed averages, obtained by shrinking them 
a certain amount of the way toward the grand aver- 
age 0.265, as described next. By the end of the 1970 
season, Morris could see the "truth," the players' 
averages over the remainder of the season. If predic- 
tion error is measured by total squared discrepancy 
from the truth, then James-Stein wins handsomely: 
its total squared prediction error was less than one- 
third of that for the observed averages. This wasn't 
a matter of luck, as we will see. 

Suppose each player has a true expectation fn and 
an observed average Xj, following the model 

(1) /ii~7V(M,yl) and Xi\fii Mim^a"^) 

for f = 1, 2, . . . , = 18. Here M and A are mean 
and variance hyper-parameters that determine the 
Bayesian prior distribution; Hi can be thought of as 
the "truth" in Table 1, the observed average, 

and (Tg as its approximate binomial variance 0.265 • 
(1 — 0.265)/45. (I won't worry about the fact that 
Xi is binomial rather than perfectly normal.) 

The posterior expectation of //j given Xj, which is 
the Bayes estimator under squared error loss, is 

/if^y-)=M + 5(xi-M) 
Table 1 

Batting averages for 18 major league players early m the 

1970 season ("Observed") and their averages for the 
remainder of the season ("Truth"). Also the James-Stein 
predictions 



Name 


Hits/AB 


Observed 


"Truth" 


James-Stein 


1. Clemente 


18/45 


0.400 


0.346 


0.294 


2. F. Robinson 


17/45 


0.378 


0.298 


0.289 


3. F. Howard 


16/45 


0.356 


0.276 


0.285 


4. Johnstone 


15/45 


0.333 


0.222 


0.280 


14. Petrocelli 


10/45 


0.222 


0.264 


0.256 


15. E. Rodriguez 


10/45 


0.222 


0.226 


0.256 


16. Campaneris 


9/45 


0.200 


0.286 


0.252 


17. Munson 


8/45 


0.178 


0.316 


0.247 


18. Alvis 


7/45 


0.156 


0.200 


0.242 


Grand average 




0.265 


0.265 


0.265 



where B = ^ . 

If ^ = (Tq, for example, Bayes rule shrinks each ob- 
served average Xj half way toward the prior mean 
M. Using Bayes rule reduces the total squared error 
of prediction, compared to using the obvious esti- 
mates Xi, by a factor of 1 — B. This is a 50% savings 
if j4 = ctq, and more if the prior variance A is less 
than ctq. 

Baseball experts might know accurate values for 
M and ^, or M and B, but we are not assuming 
expert prior knowledge here. The James-Stein esti- 
mator can be motivated quite simply: unbiased es- 
timates M and B are obtained from the vector of 
observations x = (xi, X2, . . . , xat) (e.g., M = x the 
grand average) and substituted into formula (2). In 
Herbert Robbins' apt terminology, James-Stein is 
an empirical Bayes estimator. It doesn't perform 
as well as the actual Bayes estimate (2), but under 
model (1) the penalty is surprisingly small. 

All of this seems interesting enough, but a skeptic 

might ask where the normal prior distributions Hi '~ 
^{MjA) in (1) are coming from. In fact, James and 
Stein didn't use normal priors, or any priors at all, in 
their derivation. Instead they proved the following 
frequentist theorem. 

Theorem 1 (1956). // Xj ~ A/'(/Xj,(To) indepen- 
dently for i = 1,2, . . . ,N, N > 4, then the James- 
Stein estimator always beats the obvious estimator 
Xi in terms of expected total squared estimation er- 
ror. 

This is the single most striking result of post- 
World War II statistical theory. It is sometimes called'^ 
Stein's paradox for it says that Clemente's good per- 
formance does increase our estimate for Munson (e.g., 
by increasing M = x) and vice versa, even though 
they succeed or fail independently. In addition to 
the direct evidence of each player's batting average, 
we gain indirect evidence from the other 17 averages. 

James-Stein estimation is not an unmitigated bless- 
ing. Low total squared error can conceal poor per- 
formance on genuinely unusual cases. Baseball fans 
know from past experience that Clemente was an 
unusually good hitter, who is learning too much 



■^Willard James was Charles Stein's graduate student. Stein 
had sliown earlier that another, less well-motivated, estimator 
dominated the obvious rule. 
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from the experience of others by being included in 
a cohort of less-talented players. I'll call this the 
Clemente problem in what follows. 

6. LARGE-SCALE MULTIPLE INFERENCE 

All of this is a preface, and one that could have 
been written 50 years ago, to what I am really in- 
terested in talking about here. Large-scale multiple 
inference, in which thousands of statistical problems 
are considered at once, has become a fact of life for 
21st-century statisticians. There is just too much in- 
direct evidence to ignore in such situations. Coming 
to grips with our new, more intense, scientific en- 
vironment is a major enterprise for the statistical 
community, and one that is already affecting both 
theory and practice. 

Rupert Miller's book Simultaneous Statistical In- 
ference appeared in 1966, lucidly summarizing the 
post-war boom in multiple-testing theory. The book 
is overwhelmingly frequentist, aimed mainly at the 
control of type I error, and concerned with the simul- 
taneous analysis of between 2 and perhaps 10 testing 
problems. Microarray technology introduced in the 
1990s dramatically raised the ante: number of prob- 
lems now easily exceeds 10,000; "SNP chips" have 
N = 500,000+, and imaging devices reach higher 
still. 

Figure 3 concerns a microarray study in which the 
researchers were on a fishing expedition to find genes 
involved in the development of prostate cancer: 102 
men, 50 healthy controls and 52 prostate cancer pa- 
tients, each had expression levels for N = 6033 genes 
measured on microarrays. The resulting data ma- 
trix had N = 6033 rows, one for each gene, and 102 
columns, one for each man. 

As a first step in looking for "interesting" genes, a 
two-sample t-statistic ti comparing cancer patients 
with controls was computed for each gene i, i = 
1,2, . . . , N, and then converted to a z-value 

(3) Zi = '^-\Fwo{t,)) 

with <I> and Fiqq the c.d.f.'s of a standard normal 
and iioo variate. Under the usual textbook condi- 
tions, Zi will have a standard normal distribution in 
the null (uninteresting) situation where genetic ex- 
pression levels are identically distributed for controls 
and patients, 

(4) Ho:z,r^J\f{0,l). 

A histogram of the A^ = 6033 z-values appears in 
Figure 3. It is fit reasonably well by the "theoret- 
ical null" curve that would apply if all the genes 



followed (4), except that there is an excess of tail 
values, which might indicate some interesting "non- 
null" genes responding differently in cancer and con- 
trol subjects. 

Here I will concentrate on the 49 genes having 
Zi exceeding 3.0, as indicated by the hash marks. 
Figure 4 shows a close-up of the right tail, where 
we notice that 49 is much greater than 8.14, the 
expected number of Zj's exceeding 3.0 under full null 
conditions. The ratio is 

(5) Fdi^(3.0) = ^ = l. 

where Fdr stands for false discovery rate, in Ben- 
jamini and Hochberg's evocative terminology. Re- 
porting the list of 49 back to the investigators seems 
like a good bet if it only contains 1/6 duds, but can 
we believe that value? 

Benjamini and Hochberg's (1995) paper answered 
the question with what I consider the second most 
striking theorem of post-war statistics. For any given 
cutoff point c let N{c) be the number of Zj's observed 
to exceed c, Eq (c) the expected number exceeding c 
if all genes are null (4), and 

(6) Fdi{c) = Eo{c)/N{c). 

[In (5), c = 3.0, N{c) = 49, and Eo{c) = 8.14.] Choose 
an Fdr control value q between and 1 and let Cq 
be the smallest value of c such that Fdr(c) < q. 

Theorem 2. If the N z-values are independent 
of each other, then the rule that rejects the null hy- 
pothesis (4) for all cases having Zi > Cq will make the 
expected proportion of false discoveries no greater 
than q. 

In the prostate data example, choosing (7 = 1/6 
gives = 3.0 and yields a list of 49 presumably in- 
teresting genes. Assuming independence^, the the- 
orem says that the expected proportion of actual 
null cases on the list is no greater than 1/6. That is 
a frequentist expectation, Benjamini and Hochberg 
like James and Stein having worked frequentisti- 
cally, but once again there is an instructive Bayesian 
interpretation. 

A very simple Bayes model for simultaneous hy- 
pothesis testing, the two-groups model, assumes that 
each gene has prior probability po or pi = 1 — po of 



*This isn't a bad assumption for the prostate data, but a 
dangerous one in general for microarray experiments. How- 
ever, dependence usually has little effect on the theorem's 
conclusion. A more common choice of q is 0.10. 
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Fig. 3. Histogram of N = 6033z-values from the prostate cancer study compared with the theoretical null density that would 
apply if all the genes were uninteresting. Hash marks indicate the 49 z-values exceeding 3.0. 
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Fig. 4. Close-up of right tail of the prostate data z-value histogram; 49 Zi 's exceed 3.0, compared to an expected number 8.14 
if all genes were null (4)- 
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being null or non-null, with corresponding z-value 
density foiz) or fi{z): 



(7) Prior probability I 



Zi ~ 



h{z). 



Let Fq{z) and Fi{z) be the right-sided c.d.f.'s {sur- 
vival functions) corresponding to /q and /i, and 
F[z) their mixture, 



(8) 



F{z)=pqFq{z)+piFi{z). 



Applying Bayes theorem shows that the true false 
discovery rate is 



(9) 



Fdr(c) = Prjgene i nulljzj > c} 
= Poi^o(c)/F(c). 



(Left-sided c.d.f.'s perform just as well, but it is con- 
venient to work on the right here.) 

Of course we can't apply the Bayesian result (9) 
unless we know poi /o) and /i in (7). Once again 
though, a simple empirical Bayes estimate is avail- 
able. Under the theoretical null (4), Fq[z) = 1 — 
the standard normal right-sided c.d.f.; po will 
usually be close to 1 in fishing expedition situa- 
tions and has little effect on Fdr(c). (Benjamini and 
Hochberg set pq = 1. It can be estimated from the 
data, and I will take it as known here.) That leaves 
the mixture c.d.f. F{z) as the only unknown. But by 
definition, all N Zi values follow F{z), so we can esti- 
mate it by the empirical c.d.f. F{z) = #{zi > z}/N, 
leading to the empirical Bayes estimate of (9), 



(10) 



Fdr(c)=poi^o(c)/F(c). 



The two definitions of Fdr(c), (6) and (10), are the 
same since Eq{c) = NpoFq{c) and F{c) = N{c)/N. 
This means we can restate Benjamini and Hochberg's 
theorem in empirical Bayes terms: the list of cases 
reported by BH(g), the Benjamini-Hochberg-level q 
rule, is essentially those cases having estimated pos- 
terior probability of being null no greater than q. 

The Benjamini-Hochberg algorithm clearly 
involves indirect evidence. In this case, each value 
is learning from the other — 1 values: if, say, only 
10 instead of 49 z-values had exceeded 3.0, then 
Fdr(c) would equal 0.81 (i.e., "very likely null") so 
a gene with Zi > 3.0 would now not be reported as 
non-null. 

I have been pleasantly surprised at how quickly 
false discovery rate control was accepted by statis- 
ticians and our clients. It is fundamentally different 



from type I error control, the standard for nearly 
a century, in its Bayesian aspect, its use of indirect 
evidence, and in the fact that it provides an explicit 
estimate of nullness Fdr(2:) rather than just a yes/no 
decision.^ 

7. THE PROPER USE OF INDIRECT 
EVIDENCE 

The false discovery rate story is a promising sign 
of our profession's ability to embrace new methods 
for new problems. However, in moving beyond the 
confines of classical statistics we are also moving 
outside the wall of protection that a century of the- 
ory and experience has erected against inferential 
error. 

Within its proper venue, it is hard to go very 
wrong with a frequentist analysis of direct evidence. 
I find it quite easy to go wrong in large-scale data 
analyses. This section and the next offer a couple of 
examples of the pitfalls yawning in the use of indi- 
rect evidence. None of this is meant to be discourag- 
ing: difficulties are what researchers thrive on, and 
I fully expect statisticians to successfully navigate 
these new waters. 

The results of another microarry experiment, this 
time concerning leukemia, are summarized in Fig- 
ure 5. High-density oligonucleotide microarrays pro- 
vided expression levels on = 7128 genes for 72 pa- 
tients, 45 with ALL (acute lymphoblastic leukemia) 
and 27 with AML (acute myeloid leukemia), the lat- 
ter having worse prognosis. Two-sample t-statistics 
provided z-values Zi for each gene, as with the prostate 
study. 

Figure 5 shows that this time the center of the 
z- value histogram does not approximate a AA(0, 1) 
density. Instead, it is much too wide: a maximum 
likelihood fit to central histogram heights gave es- 
timated proportion po = 0.93 of null genes in the 
two- groups model (7), and an empirical null density 
estimate 



/o(z)~AA(0.09,1.682), 



more than half again as wide as the AA(0, 1) theo- 
retical null (4). The dashed curve shows (11) nicely 
following the histogram height near the center while 
the estimated proportion of non-null genes pi = 1 — 
Po = 0.07 appear as heavy tails, noticeably on the 
left. 



'''Although one might consider p-values to provide such es- 
timates in classical testing. 
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At this point one could maintain faith in the the- 
oretical null but at the expense of concluding that 
about 2500 (35%) of the genes are involved in 
AML/ALL differences. On the other hand, there 
are plenty of reasons to doubt the theoretical null. 
In particular, the leukemia data comes from an ob- 
servational study, not a randomized experiment, so 
that unobserved covariates (age, sex, health status, 
race, etc.) could easily add a component of variance 
to both the null and non-null z-values. 

The crucial question here has to do with the nu- 
merator Eq{c) in Fdr(c) = Eq{c)/N{c), the expected 
number of null cases exceeding c. The theoretical 
A/'(0, 1) null predicts many fewer of these than does 
the empirical null (11). The fact that we might esti- 
mate the appropriate null distribution from evidence 
at hand — bordering on heresy from the point of view 
of classical testing theory — shows the opportunities 
inherent in large-scale studies, as well as the novel 
inferential questions surrounding the use of indirect 
evidence. 

8. RELEVANCE 

Large-scale testing algorithms are usually carried 
out under the tacit assumption that all available 
cases should be analyzed together: for instance, em- 
ploying a single false discovery analysis for all the 
genes in a given microarray experiment. This can be 
a dangerous assumption, as the example illustrated 
in Figure 6 will show. 

Twelve children, six dyslexics and six normal con- 
trols, received DTI (diffusion tensor imaging) scans, 
measuring fluid diffusion at = 15,443 locations 
(voxels) in the brain. A z-value Zi was computed at 
each voxel such that the theoretical null hypothesis 
Zi ~ A^(0, 1) should apply to locations where there 
is no dyslexic/normal distributional difference. The 
goal of course was to pinpoint areas of genuine dif- 
ference. 

Figure 6 indicates the z-values in a horizontal slice 
of the brain about half-way from bottom to top. 
Open circles, colored red, indicate Zi > 0, solid red 
circles Zi > 2; green + symbols indicate Zi < 0, with 
green ^ for Zi < —2. The x-axis measures distance 
from the back of the brain to the front, left to right. 

Spatial correlation among the Zj's is evident: red 
circles are near red circles and green +'s near other 
green +'s. The Benjamini-Hochberg Fdr control al- 
gorithm tends to perform as claimed as an hypothesis- 
testing device, even under substantial correlation. 



However, there is an empirical Bayes price to pay: 
correlation makes Fdr(c) (10) less dependable as an 
estimate of the true Bayes probability (9). Just how 
much less is a matter of current study. 

There is something else to worry about in Figure 
6: the front half of the brain, x > 50, seems to be 
redder (i.e., with more positive z- values) than the 
back half. This is confirmed by the superimposed 
histograms for the two halves, about 7700 voxels 
each, seen in Figure 7. Separate Fdr tests at control 
level q = 0.10 yield 281 "significant" voxels for the 
front-half data, all those with Zi > 2.69, and none 
at all for the back half. But if we analyze all 15443 
voxels at once, the Fdr test yields only 198 signifi- 
cant voxels, those having Zj > 3.02. Which analysis 
is correct? 

This is the kind of question my warning about 
difficult new inference problems was aimed at. No- 
tice that the two histograms differ near their centers 
as well as in the tails. The Fdr analyses employed 
thoretical AA(0, 1) null distributions. Using empirical 
nulls as with the leukemia data gives quite different 
null distributions, raising further questions about 
proper comparisons. 

The front /back division of the brain was arbitrary 
and not founded on any scientific criteria. Figure 8 
shows all 15,443 Zj's plotted against Xi, the voxel's 
distance from the back. We see waves in the z-values, 
at the lower percentiles as well as at the top, crest- 
ing near x = 64. Disturbingly, most of the 281 sig- 
nificant voxels for the front-half analysis came from 
this crest. 

Maybe I should be doing local Fdr tests of some 
sort, or perhaps making regression adjustments (e.g., 
subtracting off the running median) before applying 
an Fdr procedure. We have returned to a version of 
the Clemente problem: which are the relevant vox- 
els for deciding whether or not any given voxel is 
responding differently in dyslexics and controls? In 
other words, where is the relevant indirect informa- 
tion? 

9. THE NORMAL HIERARCHICAL MODEL 

My final example of indirect evidence and empir- 
ical Bayes inference concerns the normal hierarchi- 
cal model. This is a simple but important Bayesian 
model where /i, a parameter of interest, comes from 
some prior density g{-) and we get to observe a nor- 
mal variate z centered at //, 

(12) Ai~5(-) and ~ A/'(/U, 1). 
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Fig. 8. z-values for the 15,443 voxels plotted versus their distance from the back of the brain. A disturbing wave pattern is 
evident, cresting near x = 64. Most of the 281 significant voxels in Figure 7 come from this crest. 
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Both the James-Stein and Benjamini-Hochberg es- 
timators can be motivated from (12), 

JS:g=M{M,A) and 

(13) 

B}i:g = po5o+pigi. 

In the latter, 6q is a delta function at while gi is 
an arbitrary density giving fi in (7) by convolution, 
fi = 9i * f where (p is the standard normal density. 

In the BH setting, we might call fXi (the value of /x 
for the ith case) the effect size. For prediction pur- 
poses, we want to identify cases not only with ^ 
but with large effect size. A very useful property of 
the normal hierarchical model (12) allows us to cal- 
culate the Bayes estimate of effect size directly from 
the convolution density f = g *(p without having to 
calculate g, 

/oo 
-oo 

Lemma 1. Under the normal hierarchical model 
(12), 

(15) E{^l\z} = z + f\z)/f{z), 

where f'{z) = df{z)/dz. 

The marginal density of z in model (12) is f{z). 
So if we observe z = (zi, 2:2, . . . , zn) from repeated 
realizations of {fii,Zi), we can fit a smooth density 
estimate f{z) to the Zj's and use the lemma to ap- 
proximate E{fii\zi}, 

(16) z f{z) E{^li\z,} = Zi + f'{z.i)/f{zi). 

This has been done in Figure 9 for the prostate data 
of Figure 3, with f{z) a natural spline, fit with 7 
degrees of freedom to the heights of Figure 3's his- 
togram bars (all of them, not just the central ones 
we used to estimate empirical nulls). 

The effect size estimates fn = E{^i\zi\ are nearly 
zero for \zi\ less than 2 but increase linearly outside 
of this interval. Gene 610 has the largest z-value, 
•2610 = 5.29, with estimated effect size /igio = 4.11. 
Table 2 shows the top 10 genes in order of and 
their corresponding effect sizes fii . The fii values are 
shrunk toward the origin, but in a manner appropri- 
ate to the BH prior in (13), not JS. 

The necessity for shrinkage reflects selection bias: 
the top 10 genes were winners in a competition with 
6023 others; in addition to being "good" in the sense 
of having genuinely large effect sizes, they've proba- 
bly been "lucky" in that their random measurement 



errors were directed away from zero. Regression to 
the mean is another name for the shrinkage effect. 

A wonderful fact is that Bayes estimates are im- 
mune to selection bias! If /igio = 4.11 was the actual 
Bayes estimate S{/i6io|z} then it would not matter 
that we became interested in Gene 610 only after 
examining all 6033 z-values: 4.11 would still be our 
estimate. This may seem surprising, but it follows 
immediately from Bayes theorem, a close cousin to 
results such as "Bayes inference in a clinical trial is 
not affected by intermediate looks at the data." 

Any assumption of a Bayes prior is a powerful 
statement of indirect evidence. In our example it 
amounts to saying, "We have an infinite number N 
of relevant prior observations (//, z) with z = 5.29, 
and for those the average value of /i is 4.11." The 
N = 00 prior observations outweigh any selection 
effects in the comparatively puny current sample, 
which is another way of stating the wonderful fact. 

Of course, we usually don't have an infinite amount 
of relevant past experience. Our empirical Bayes es- 
timate fiQio = 4.11 is based on just the N = 6033 
observed Zi values. One might ask how immune are 
empirical Bayes estimates to selection bias? This 
is the kind of important indirect-evidence question 
that I'm hoping statisticians will soon be able to 
answer. 

10. LEARNING FROM THE EXPERIENCE OF 
OTHERS 

As I said earlier, current statistical practice is dom- 
inated by frequentist methodology based on direct 
evidence. I don't believe this kind of single-problem 

= 1 thinking, even supplemented by aggressive re- 
gression technology, will carry the day in an era of 



Table 2 

Top 10 genes, those with largest values of \zi\, in the prostate 
study and their corresponding effect size estimates jli 





Gene 


z-value 


/xi = E{m\zi} 


1 


610 


5.29 


4.11 


2 


1720 


4.83 


3.65 


3 


332 


4.47 


3.24 


4 


364 


-4.42 


-3.57 


5 


914 


4.40 


3.16 


6 


3940 


-4.33 


-3.52 


7 


4546 


-4.29 


-3.47 


8 


1068 


4.25 


2.99 


9 


579 


4.19 


2.92 


10 


4331 


-4.14 


-3.30 
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Fig. 9. Empirical Bayes effect size estimate E{yL\z} (16), prostate data of Figure 3. Dots indicate the top 10 genes, those 
with the greatest values of \zi\. The top gene, i = 610, has Zi = 5.29 and estimated effect size 4.11. 



enormous data sets and large-scale inferences. The 
proper use of indirect evidence — learning from the 
experience of others — is a pressing challenge for both 
theoretical and applied statisticians. Perhaps I should 
just say that frequentists need to become better 
Bayesians. 

This doesn't let Bayesians off the hook. A "the- 
ory of everything" can be a dangerous weapon in the 
messy world of statistical applications. The tacit as- 
sumption of having N = oo relevant past cases avail- 
able for any observed value of the data can lead 
to a certain reckless optimism in one's conclusions. 
Frequentism is a leaky philosophy but a good set 
of work rules. Its fundamentally conservative atti- 
tude encourages a careful examination of what can 
go wrong as well as right with statistical procedures 
and, as I've tried to say, there's no shortage of wrong 
steps possible in our new massive-data environment. 

Fisherian procedures, which I haven't talked about 
here, often provide a pleasant compromise between 
Bayesian and frequentist methodology. Maximum 
likelihood estimation in particular can be interpreted 
from both viewpoints, as a preferred way of com- 
bining evidence from different sources. Fisher's the- 
ory was developed in a small-sample direct-evidence 
framework, however, and doesn't answer the ques- 
tions raised here. Mainly it makes me hope for a new 



generation of Fishers, Neymans, Hotellings, etc., to 
deal with 21st-century problems. 

Empirical Bayes methods seem to me to be the 
most promising candidates for a combined 
Bayesian/frequentist attack on large-scale data anal- 
ysis problems, but they have been "promising" for 
50-plus years now, and have yet to form into a coher- 
ent theory. Most pressingly, both frequentists and 
Bayesians enjoy convincing information theories say- 
ing how well one can do in any given situation, while 
empirical Bayesians still operate on an ad hoc basis. 

This is an exciting time to be a statistician: we 
have a new class of difficult but not impossible prob- 
lems to wrestle with, which is the most any intellec- 
tual discipline can hope for. The wrestling process is 
already well underway, as witnessed in our journals 
and conferences. Like most talks that have "future" 
in the title, this one will probably seem quaint and 
limited not very long from now, but perhaps the 
discussants will have more to say about that. 
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